Introduction
In this workshop we will be analysing pre processed RNA sequencing data, specifically reads from the raw tumour RNA-seq that do NOT map to the human reference genome (GRCh38), we will refer these as unmapped reads.
By charecterising the origin of these non-human unmapped reads we can identify poteintal bacterial or viral species which may be present in the tumour microenvinonemnt of this patient.
- For this workshop we will only be completing the steps highlighted in red, please refer to the RNA-seq workshops for alignments and quality control.
Below is an overview of the pipeline:
Tools
All the tools required to run this pipeline are listed below:
- FastQC - Version 0.11.3 | Website | User manual
- Trimmomatic - Version 0.36 | Website | User manual | Publication
- HISAT2 - Version 2.1.0 | Website | User manual | Publication
- Samtools - Version 1.5 | Website | User manual | Publication
- rnaSPAdes - Version 3.13.0 | Website | User manual | Publication
- MiniKraken2 - Version 2.0 Website | User Manual | Publication
- R - Version 3.4.3 | Website
- ggplot2 - Version 3.1.0 | Website | User manual
Workshop setup
First we need to setup the enviroment to analyse the data in the workshop. This analysis will be performed in an interactive session, which is setup as followed:
# Setting master paths for the whole workshop
#Don't forget to replace *your-username* with your ciw number
output_dir_master=/scratch/ciw/ciw30/metagenomics/Output
raw_rna_seq_data=/scratch/ciw/ciw30/metagenomics/Raw_input
#Do not change this one
resources=/scratch/ciw/ciw40/ToolsExtracting unmapped reads form a BAM file
In the Cancer_Microbiome_workshop folder you will find a BAM file labeled 'crc_patient.hisat.bam' this is an RNA-seq sample which has been pre-aligned to the human reference genome (GRCh38)
To extract unmapped reads from the BAM file we will use a package called Samtools:
outputdir=$output_dir_master/unmapped
mkdir -p $outputdir
#Extract the Unmapped reads from the input Bam file
bsub -J "samtools" -e Error_log/samtools_err.log -o Log/samtools_out.log -n 10 "$resources/samtools/samtools view -b -f 4 $raw_rna_seq_data/crc_patient.hisat.bam > $outputdir/unmapped.bam"We can use Samtools again to check to see how many reads we have left in our RNA-seq sample after we have extracted the unmapped reads and compare this to the initial number of reads we started with
- As you can see only a small fraction of the total number of reads we started with remain
#Count the total number of reads in our initial BAM file
outputdir=$output_dir_master/read_counts
mkdir -p $outputdir
bsub -J "samtools" -e Error_log/samtools_raw_count_err.log -o Log/samtools_raw_count_out.log -n 10 "$resources/samtools/samtools view -c $raw_rna_seq_data/crc_patient.hisat.bam > $outputdir/Raw_read_count.txt"
#Count the number of unmapped reads we have extracted
bsub -J "samtools" -e Error_log/samtools_unmapped_read_count_error.err -o Log/samtools_unmapped_read_count_out.log -n 5 "$resources/samtools/samtools view -c $output_dir_master/unmapped/unmapped.bam > $output_dir_master/read_counts/Unmapped_read_count.txt"Convert BAM file to FASTQ Format
Before we can proceed further to any downstream processing of our unmapped reads we must convert our BAM file to FASTQ Format
In this example we will be using the Bedtools package - BioBamBam is also a commonly used package for this step, especially if you are working with a mixed library
outputdir=$output_dir_master/bedtools
mkdir -p $outputdir
bsub -J "bedtools" -e Error_log/bedtools_error.err -o Log/bedtools_out.log -n 5 $resources/bedtools2/bin/bedtools bamtofastq -i $output_dir_master/unmapped/unmapped.bam -fq $outputdir/unmapped.fqDe novo assembly with SPAdes
SPAdes analyses the reads and firstly arranges them into contigs and scaffolds
Contigs are defined as regions which overlap between sequences that together represent a consensus sequence
Scaffolds consist of overlapping contigs separated by gaps of known length
To run de novo assembly we will be using a python package called rnaspades.py
#Make an output directory called 'spades'
outputdir=$output_dir_master/spades
mkdir -p $outputdir
#Running rnaspades.py
# -s parameter indicates our input file consists of single-end RNA-seq data if we are using data with paired-end reads however instead of the -s parameter we should use -1 for the forward strand -2 for the reverse strand
# -o indicates the output file we want to store to our results in
bsub -J "Spades" -e Error_log/spade_error.err -o Log/spade_out.log -n 10 python $resources/SPAdes/bin/rnaspades.py -s $output_dir_master/bedtools/unmapped.fq -o $outputdir --only-assemblerNow lets have a look at the output file and count the number of scaffolds SPAdes has identified from our sample:
# less allows you to read the contents of a file, to quit viewing this file just type the letter q and you will return to your terminal
less $output_dir_master/spades/transcripts.fastaYour output should look something like this:
Each sequence in the FASTA file has a header In this case: >NODE_1_length_3986_cov_17.47736 Each number in the notation of the header has a different meaning:
1 is the number of the contig/scaffold
3986 is the sequence length in nucleotides
17.47736 is the k-mer coverage for the last (largest) k value used. Note that the k-mer coverage is always lower than the read (per-base) coverage.
Next lets count the number of scaffolds SPAdes has identified from our sample:
# Count the number of contigs using grep
grep -e '>' $output_dir_master/spades/transcripts.fasta | wc -l Contig/Scaffold Classification with Kraken2
To charecterise the origin of our SPAdes output we will now run Kraken2
Kraken2 uses k-mers to assign a taxonomic labels in form of NCBI Taxonomy to the sequence. The taxonomic label is assigned based on similar k-mer content of the sequence in question to the k-mer content of reference genome sequence. The result is a classification of the sequence in question to the most likely taxonomic label. If the k-mer content is not similar to any genomic sequence in the database used, it will not assign any taxonomic label.
Kraken2 allows you to build several combinations of custom databases however in this example we will be using minikraken2
Minikraken2 is compiled by compressing the complete bacterial, archaeal, and viral RefSeq genomes from NCBI.
- Minikraken2 is a compressed database and contains only 2.7% of kmers from the original Kraken2 database
outputdir=$output_dir_master/kraken
mkdir -p $outputdir
minikraken=/scratch/ciw/ciw40/Resources/minikraken2_v1_8GB/
bsub -J "kraken2" -e Error_log/kraken2_error.err -o Log/kraken2_out.log "$resources/kraken2/kraken2 --use-names --db $minikraken $output_dir_master/spades/transcripts.fasta --report $outputdir/unmapped_reads.report.txt > $outputdir/unmapped_reads.kraken"Now lets look at the output :
# less allows you to read the contents of a file, to quit viewing this file just type the letter q and you will return to your terminal
less $outputdir/unmapped_reads.report.txtThe output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:
- Percentage of reads covered by the clade rooted at this taxon
- Number of reads covered by the clade rooted at this taxon
- Number of reads assigned directly to this taxon
- A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply “-“.
- NCBI Taxonomy ID
- The indented scientific name
Species and Genus Level Relative Abundance Estimations with Bracken
Bracken stands for Bayesian Re-estimation of Abundance with KrakEN, and is a statistical method that computes the abundance of species in DNA sequences from a metagenomics sample [LU,2017].
Bracken uses the taxonomy labels assigned by Kraken2 (see above) to estimate the number of reads originating from each species present in a sample.
Bracken classifies reads to the best matching location in the taxonomic tree, but does not estimate abundances of species.
Combined with the Kraken classifier, Bracken will produces more accurate species- and genus-level abundance estimates than Kraken2 alone.
#To run bracken we need to
outputdir=$output_dir_master/braken
mkdir -p $outputdir
bsub -J "bracken" -e Error_log/bracken_error.err -o Log/bracken_out.log $resources/Bracken/bracken -d $minikraken -i $output_dir_master/kraken/unmapped_reads.report.txt -l S -o $outputdir/unmapped.brackenData Visualisation with Krona
Krona tools is a package which allows us to create a nice interactive visualisation of all of the bacterial and viral genes we have identified and quanitifed from our colorectal cancer sample
First we need to prepare the input file for Krona, to do this we need to take column 1 and column 5 from our Kraken report file
outputdir=$output_dir_master/krona
mkdir -p $outputdir
cat $output_dir_master/kraken/unmapped_reads.kraken | cut -f 1,5 > $outputdir/unmapped.kraken.krona
bsub -J "Krona" -e Error_log/Krona_error.err -o Log/Krona_out.log $resources/KronaTools-2.7.1/scripts/ImportTaxonomy.pl $outputdir/unmapped.kraken.krona -o $outputdir/unmapped-krona.html